In [1]:
    
import pandas as pd
    
In [2]:
    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
    
In [3]:
    
def print_array(a):
    print("{} elements of type {}: {}".format(len(a), a.dtype.name, a))
    
a = np.array([1,2,3])
print_array(a)
    
    
In [4]:
    
a = np.array([1.5, 2.5, 3.5])
print_array(a)
    
    
In [5]:
    
a = np.array(['cześć', 'software', 'carpentry'])
print_array(a)
    
    
Single elements can be retrieved by integer (!) index of the element starting from 0:
In [6]:
    
a = np.array([101, 102, 103, 104, 105])
print(a[1])
    
    
Sub-array of consecutive elements can be retrived with a slice:
In [7]:
    
print(a[1:3])
    
    
In [8]:
    
a = np.arange(12).reshape(3, 4)
print(a)
    
    
In [9]:
    
print(a[1, 2])
    
    
In [10]:
    
print(a[1:, 2:])
    
    
a
In [11]:
    
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s
    
    Out[11]:
In [12]:
    
s.index
    
    Out[12]:
You can access the underlying numpy array representation with the .values attribute:
In [13]:
    
s.values
    
    Out[13]:
We can access series values via the index, just like for NumPy arrays:
In [14]:
    
s[0]
    
    Out[14]:
Unlike the NumPy array, though, this index can be something other than integers:
In [15]:
    
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
    
    Out[15]:
In [16]:
    
s2['c']
    
    Out[16]:
It's possible to construct a series directly from a Python dictionary. Let's first define the dictionary.
In [17]:
    
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
pop_dict['Germany']
    
    Out[17]:
Trying to access non-existing keys in a dictionary will produce an error:
In [18]:
    
# pop_dict['Poland']
    
But we can add new keys easily:
In [19]:
    
pop_dict['Poland'] = 40
pop_dict
    
    Out[19]:
NumPy-style arithmetical operations won't work:
In [20]:
    
#pop_dict * 1000
    
Now we construct a Series object from the dictionary.
In [21]:
    
population = pd.Series(pop_dict)
population
    
    Out[21]:
We can index the populations like a dict as expected:
In [22]:
    
population['France']
    
    Out[22]:
but with the power of numpy arrays:
In [23]:
    
population * 1000
    
    Out[23]:
Many things we have seen for NumPy, can also be used with pandas objects.
Slicing:
In [24]:
    
population['Belgium':'Germany']
    
    Out[24]:
A range of methods:
In [25]:
    
population.mean()
    
    Out[25]:
In [ ]:
    
    
Series containing prices of beverages:</div>
Beer              5
Coffee            2.5
Orange Juice      5
Water             2
Wine              6
In [ ]:
    
    
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.
Note that in the IPython notebook, the data frame will display in a rich HTML view:
In [28]:
    
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries
    
    Out[28]:
In [29]:
    
countries.index
    
    Out[29]:
In [30]:
    
countries.columns
    
    Out[30]:
To check the data types of the different columns:
In [31]:
    
countries.dtypes
    
    Out[31]:
An overview of that information can be given with the info() method:
In [32]:
    
countries.info()
    
    
Also a DataFrame has a values attribute which returns its numpy representation:
In [33]:
    
countries.values
    
    Out[33]:
If we don't like what the index looks like, we can reset it and set one of our columns:
In [34]:
    
countries = countries.set_index('country')
countries
    
    Out[34]:
To access a Series representing a column in the data, use typical indexing syntax:
In [35]:
    
countries['area']
    
    Out[35]:
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
For example there's arithmetic. Let's compute density of each country:
In [36]:
    
countries['population']*1000000 / countries['area']
    
    Out[36]:
Adding a new column to the dataframe is very simple:
In [37]:
    
countries['density'] = countries['population']*1000000 / countries['area']
countries
    
    Out[37]:
And we can do things like sorting the items in the array, and indexing to take the first two rows:
In [38]:
    
countries.sort_values(by='density', ascending=False)
    
    Out[38]:
One useful method to use is the describe method, which computes summary statistics for each column:
In [39]:
    
countries.describe()
    
    Out[39]:
The plot method can be used to quickly visualize the data in different ways:
In [40]:
    
countries.plot()
    
    Out[40]:
    
However, for this dataset, it does not say that much:
In [41]:
    
countries['population'].plot(kind='bar')
    
    Out[41]:
    
You can play with the kind keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'
In [ ]:
    
    
In [ ]:
    
    
A wide range of input/output formats are natively supported by pandas:
In [44]:
    
pd.read_csv
    
    Out[44]:
In [45]:
    
countries.to_csv
    
    Out[45]:
© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).
© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)
Licensed under CC BY 4.0 Creative Commons
This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).
In [ ]: